431 Class 11

Thomas E. Love, Ph.D.

2023-10-03

Today’s Agenda

  • Pulling in data for a new example with read_Rds()
  • Exploring a quantity, broken down into > 2 subgroups
    • Visualization gallery: comparison boxplot, faceted histograms, density and ridgeline plots
    • Assessing the assumptions of regression (ANOVA) with residual plots
  • Intro: Using simple (single) imputation to deal with missing data
  • An Example to Work through on your own

Today’s Setup

knitr::opts_chunk$set(comment=NA)
library(broom)               ## tidy, glance and (new) augment!
library(ggridges)            ## help with ridgeline plots
library(kableExtra)          ## tidy up tables of output
library(janitor)
library(mosaic)              ## for favstats
library(naniar)
library(simputation)         ## enable single imputation of NAs
library(patchwork)
library(tidyverse)

theme_set(theme_bw())

Today’s Data

Today, we’ll use an R data set (.Rds) to import data.

bs_dat <- read_rds("c11/data/blood_storage.Rds")
  • This allows us to read in the data just as they were last saved in R, including “factoring”, etc.
    • readRDS() also works but is a little slower.
  • To write an R data set, use write_rds(datasetname, "locationoncomputer").
    • saveRDS() would also work, but slower.

The blood storage data set

This study1 evaluates the association between red blood cells (RBC) storage duration (categorized into three groups) and time (in months) to biochemical prostate cancer recurrence after radical prostatectomy.

In cancer patients, perioperative blood transfusion has long been suspected of reducing long-term survival, and it is suspected that cancer recurrence may be worsened after the transfusion of older blood.

More complete versions of the data (along with more detailed explanations) appear in the Cleveland Clinic’s Statistical Education repository, and in the medicaldata package in R.

Codebook for bs_dat (n = 292)

Variable Description
participant subject identification code
age_group younger, middle or older (RBC age exposure)
units number of allogeneic blood transfusion units received
recur_time time (months) to biochemical recurrence of prostate cancer

Our sample includes participants who received 1-4 units.

What’s in the Data?

bs_dat
# A tibble: 292 × 4
   participant age_group units recur_time
   <chr>       <fct>     <dbl>      <dbl>
 1 102         older         2      47.6 
 2 103         older         1      14.1 
 3 104         middle        2      59.5 
 4 105         middle        3       1.23
 5 106         older         1      74.7 
 6 107         older         2      13.9 
 7 108         younger       4       8.37
 8 109         younger       1      48.6 
 9 110         middle        2      22.6 
10 111         middle        2       4.63
# ℹ 282 more rows

Missing Values?

miss_var_summary(bs_dat)
# A tibble: 4 × 3
  variable    n_miss pct_miss
  <chr>        <int>    <dbl>
1 age_group        2    0.685
2 recur_time       1    0.342
3 participant      0    0    
4 units            0    0    

Outcome is time to recurrence

p1 <- ggplot(bs_dat, aes(sample = recur_time)) +
  geom_qq(col = "dodgerblue") + 
  geom_qq_line(col = "magenta") + 
  theme(aspect.ratio = 1) + 
  labs(title = "Normal Q-Q plot: recur_time")

p2 <- ggplot(bs_dat, aes(x = recur_time)) +
  geom_histogram(aes(y = stat(density)), 
                 bins = 20, fill = "dodgerblue", col = "cyan") +
  stat_function(fun = dnorm, 
                args = list(mean = mean(bs_dat$recur_time, na.rm = TRUE), 
                            sd = sd(bs_dat$recur_time, na.rm = TRUE)),
                col = "magenta", lwd = 1.5) +
  labs(title = "Density Function: recur_time")

p3 <- ggplot(bs_dat, aes(x = recur_time, y = "")) +
  geom_boxplot(fill = "dodgerblue", notch = TRUE, 
               outlier.color = "dodgerblue") + 
  stat_summary(fun = "mean", geom = "point", 
               shape = 23, size = 3, fill = "white") +
  labs(title = "Boxplot: recur_time", y = "")

p1 + (p2 / p3 + plot_layout(heights = c(4,1)))

Outcome is time to recurrence

Compare recur_time by age_group

We’ll start with a Complete Case Analysis that ignores any case with missing data.

bs_cc <- bs_dat |> filter(complete.cases(age_group, recur_time, units))

favstats(recur_time ~ age_group, data = bs_cc) |>
  kbl(digits = 2) |> 
  kable_styling(font_size = 28, full_width = FALSE)
age_group min Q1 median Q3 max mean sd n missing
younger 0.27 9.28 31.18 52.27 101.7 34.29 29.75 96 0
middle 0.40 6.67 22.44 47.50 103.6 30.67 27.69 98 0
older 0.30 7.68 28.33 54.14 102.2 33.77 28.12 95 0

Scatterplot of recur_time vs. age_group

ggplot(bs_cc, aes(x = age_group, y = recur_time)) +
  geom_point() + geom_smooth(method = "lm", se = FALSE)

Visualizing Strategies

We’re trying to look at the impact of age_group on recur_time.

  • Comparison Boxplot
  • Faceted Histograms
  • Overlapping Density Plot
  • Ridgeline Plot

So let’s walk through each of these.

Comparison Boxplot

ggplot(data = bs_cc, aes(x = age_group, y = recur_time)) +
  geom_violin() +
  geom_boxplot(aes(fill = age_group), width = 0.3, 
               notch = TRUE, outlier.size = 2) +
  guides(fill = "none") +
  coord_flip() +
  scale_fill_viridis_d(alpha = 0.5) +
  labs(y = "Recurrence Time (in months)",
       x = "Red Blood Cell age group",
       title = "Recurrence Time by RBC Age Group")

Comparison Boxplot

Add MEANS to Comparison Boxplot

ggplot(data = bs_cc, aes(x = age_group, y = recur_time)) +
  geom_violin() +
  geom_boxplot(aes(fill = age_group), width = 0.3, 
               notch = TRUE, outlier.size = 2) +
  stat_summary(fun = "mean", geom = "point", 
               shape = 23, size = 3, fill = "white") +
  guides(fill = "none") +
  coord_flip() +
  scale_fill_viridis_d(alpha = 0.5) +
  labs(y = "Recurrence Time (in months)",
       x = "Red Blood Cell age group",
       title = "Recurrence Time by RBC Age Group",
       caption = "Diamonds indicate sample means")

Add MEANS to Comparison Boxplot

Faceted Histograms

ggplot(data = bs_cc, aes(x = recur_time, fill = age_group)) +
  geom_histogram(bins = 20, col = "navy") +
  guides(fill = "none") +
  facet_grid(age_group ~ .) +
  labs(x = "Recurrence Time (in months)",
       title = "Recurrence Time by RBC Age Group")

Faceted Histograms

Comparing Densities

ggplot(data = bs_cc, aes(x = recur_time, fill = age_group)) +
  geom_density() + scale_fill_viridis_d(alpha = 0.5, option = "A") + 
  labs(title = "Time to Recurrence, by RBC Age Group")

Using a Ridgeline Plot

ggplot(data = bs_cc, aes(x = recur_time, y = age_group, 
                       fill = age_group)) +
  geom_density_ridges(alpha = 0.5) +
  guides(fill = "none") +
  labs(title = "Time to Recurrence, by RBC Age Group")

Model Recurrence Time using Age Group

Again, we’re using the complete cases here, from bs_cc.

m1 <- lm(recur_time ~ age_group, data = bs_cc)

m1

Call:
lm(formula = recur_time ~ age_group, data = bs_cc)

Coefficients:
    (Intercept)  age_groupmiddle   age_groupolder  
        34.2885          -3.6143          -0.5193  
  • Equation m1 is: recur_time = 34.29 - 3.61 (age_group = Middle) - 0.52 (age_group = Older)

m1 regression equation?

recur_time = 34.29 - 3.61 (age_group = Middle) - 0.52 (age_group = Older)

age_group m1 estimate of recur_time (months)
Younger
Middle
Older

recur_time = 34.29 - 3.61 (age_group = Middle) - 0.52 (age_group = Older)

age_group m1 estimate of recur_time (months)
Younger 34.29
Middle
Older

recur_time = 34.29 - 3.61 (age_group = Middle) - 0.52 (age_group = Older)

age_group m1 estimate of recur_time (months)
Younger 34.29
Middle 34.29 - 3.61 = 30.68
Older

recur_time = 34.29 - 3.61 (age_group = Middle) - 0.52 (age_group = Older)

age_group m1 estimate of recur_time (months)
Younger 34.29
Middle 34.29 - 3.61 = 30.68
Older 34.29 - 0.52 = 33.77

Sample Means from bs_cc

favstats(recur_time ~ age_group, data = bs_cc) |>
  select(age_group, mean) |>
  kbl(digits = 2) |> kable_styling(font_size = 28)
age_group mean
younger 34.29
middle 30.67
older 33.77

Compare to m1 estimates (some rounding)

age_group Younger Middle Older
Est. recur_time 34.29 30.68 33.77

Tidy coefficients with broom package

tidy(m1, conf.int = TRUE, conf.level = 0.90) |> 
  kbl(digits = 2) |> kable_styling(font_size = 28)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 34.29 2.91 11.78 0.00 29.48 39.09
age_groupmiddle -3.61 4.10 -0.88 0.38 -10.38 3.15
age_groupolder -0.52 4.13 -0.13 0.90 -7.33 6.29
  • What is the 90% CI for the population mean time to recurrence for age_group = Younger?
  • What is the 90% CI for the mean difference in time to recurrence between Younger and Middle?

glance to summarize m1’s fit

glance(m1) |>
  select(r.squared, AIC, sigma, nobs, df, df.residual) |>
  kbl(digits = c(4, 1, 1, 0, 0, 0)) |> kable_styling(font_size = 28)
r.squared AIC sigma nobs df df.residual
0.0032 2762 28.5 289 2 286
  • r.squared = \(R^2\), the proportion of variation in recur_time accounted for by the model using age_group.
    • indicates improvement over predicting mean(recur_time) for everyone
  • sigma = residual standard error
  • AIC index: when comparing models with identical outcomes, lower AIC indicates a better fit

Get Fitted Values and Residuals

m1_aug <- augment(m1, data = bs_cc)

m1_aug |> 
  select(participant, recur_time, age_group, .fitted, .resid) |> 
  head() |> kbl(digits = 2) |> kable_styling(font_size = 28)
participant recur_time age_group .fitted .resid
102 47.63 older 33.77 13.86
103 14.10 older 33.77 -19.67
104 59.47 middle 30.67 28.80
105 1.23 middle 30.67 -29.44
106 74.70 older 33.77 40.93
107 13.87 older 33.77 -19.90

Why I like tidy() and other broom functions

https://github.com/allisonhorst/stats-illustrations

Linear Model Assumptions

  1. Linearity
  2. Homoscedasticity (Constant Variance)
  3. Normality

all checked with residual plots

Linear Model Assumptions?

We assume that:

  1. the regression relationship is linear, rather than curved, and we can assess this by plotting the regression residuals (prediction errors) against the fitted values and looking to see if a curve emerges.
  • Do we see a curve in the plot we draw next?

Plot residuals vs. fitted values from m1

ggplot(m1_aug, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_smooth(method = "lm", col = "red",
              formula = y ~ x, se = FALSE) + 
  geom_smooth(method = "loess", col = "blue",
              formula = y ~ x, se = FALSE) +
  labs(title = "m1: Residuals vs. Fitted Values", 
       x = "Fitted sbp values", y = "Residuals")

Plot residuals vs. fitted values from m1

Linear Model Assumptions?

We assume that:

  1. the regression residuals show similar variance across levels of the fitted values, and again we can get insight into this by plotting residuals vs. predicted values.
  • Do we see a fan shape in the plot we draw next?
  • Does the variation change materially as we move from left to right?

Plot residuals vs. fitted values from m1

A Fuzzy Football

  • What we want to see in the plot of residuals vs. fitted values is a “fuzzy football.”

Linear Model Assumptions?

We assume that:

  1. the regression residuals (prediction errors) are well described by a Normal model, and we can assess this with all of our usual visualizations to help decide on whether a Normal model is reasonable for a batch of data.
  • Do the residuals from our model appear to follow a Normal distribution?

Check Normality of m1 residuals

p1 <- ggplot(m1_aug, aes(sample = .resid)) +
  geom_qq(col = "seagreen") + geom_qq_line(col = "black") + 
  theme(aspect.ratio = 1) + 
  labs(title = "Normal Q-Q: 1050 `m1` Residuals")

p2 <- ggplot(m1_aug, aes(x = .resid)) +
  geom_histogram(aes(y = stat(density)), 
                 bins = 20, fill = "seagreen", col = "yellow") +
  stat_function(fun = dnorm, 
                args = list(mean = mean(m1_aug$.resid), 
                            sd = sd(m1_aug$.resid)),
                col = "black", lwd = 1.5) +
  labs(title = "Hist + Normal Density: `m1` Residuals")

p3 <- ggplot(m1_aug, aes(x = .resid, y = "")) +
  geom_boxplot(fill = "seagreen", outlier.color = "seagreen") + 
  stat_summary(fun = "mean", geom = "point", 
               shape = 23, size = 3, fill = "white") +  
  labs(title = "Boxplot: `m1` Residuals", y = "")

p1 + (p2 / p3 + plot_layout(heights = c(4,1)))

Check Normality of m1 residuals

Numerical Summary of Residuals?

favstats(~ .resid, data = m1_aug) |>
  kbl(digits = 1) |> kable_styling(font_size = 24)
min Q1 median Q3 max mean sd n missing
-34 -25.5 -6.7 18.5 72.9 0 28.4 289 0

Alternative m1 residual plots?

par(mfrow = c(1,2)); plot(m1, which = 1:2); par(mfrow = c(1,1))

Imputation

Dealing with the Missing Data

We have done all analyses on complete cases, but that’s not always wise.

  • What if doing so would bias our conclusions?
  • Here we have two missing age_group values and one missing recur_time.

It’s scary to estimate these missing values. What could we do?

Single Imputation

In single imputation analyses, NA values are estimated/replaced one time with one particular data value for the purpose of obtaining more complete samples, at the expense of creating some potential bias in the eventual conclusions or obtaining slightly less accurate estimates than would be available if there were no missing values in the data.

  • The simputation package can help us execute single imputations using a wide variety of techniques, within the pipe approach used by the tidyverse.

See Section 9.8 of the Course Notes for some additional examples.

Estimate missing values?

bs_dat |> select(-participant) |> summary()
   age_group      units         recur_time     
 younger:97   Min.   :1.000   Min.   :  0.270  
 middle :98   1st Qu.:2.000   1st Qu.:  7.685  
 older  :95   Median :2.000   Median : 26.690  
 NA's   : 2   Mean   :2.048   Mean   : 33.297  
              3rd Qu.:2.000   3rd Qu.: 52.685  
              Max.   :4.000   Max.   :103.600  
                              NA's   :1        

Which values are missing and must be imputed?

Create an imputation model

The simputation package is our friend here. We’ll use

  • impute_pmm() to impute quantities, and
  • impute_cart() to impute factors, for now.
bs_imp <- bs_dat |>
  impute_pmm(recur_time ~ age_group + units) |>
  impute_cart(age_group ~ units)

We start with no missing units so we use that to impute age_group, then use both age_group and units to impute recur_time. Any missing data now?

Compare Results

summary(bs_dat)
 participant          age_group      units         recur_time     
 Length:292         younger:97   Min.   :1.000   Min.   :  0.270  
 Class :character   middle :98   1st Qu.:2.000   1st Qu.:  7.685  
 Mode  :character   older  :95   Median :2.000   Median : 26.690  
                    NA's   : 2   Mean   :2.048   Mean   : 33.297  
                                 3rd Qu.:2.000   3rd Qu.: 52.685  
                                 Max.   :4.000   Max.   :103.600  
                                                 NA's   :1        
summary(bs_imp)
 participant          age_group      units         recur_time     
 Length:292         younger:98   Min.   :1.000   Min.   :  0.270  
 Class :character   middle :98   1st Qu.:2.000   1st Qu.:  7.728  
 Mode  :character   older  :96   Median :2.000   Median : 26.695  
                                 Mean   :2.048   Mean   : 33.301  
                                 3rd Qu.:2.000   3rd Qu.: 52.492  
                                 Max.   :4.000   Max.   :103.600  

Model Time Using Age with bs_imp

m1_imp <- lm(recur_time ~ age_group, data = bs_imp)

m1_imp

Call:
lm(formula = recur_time ~ age_group, data = bs_imp)

Coefficients:
    (Intercept)  age_groupmiddle   age_groupolder  
         34.855           -4.180           -0.457  

Compare Tidied Coefficients

tidy(m1, conf.int = TRUE, conf.level = 0.90) |> 
  kbl(digits = 2) |> kable_styling(font_size = 28)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 34.29 2.91 11.78 0.00 29.48 39.09
age_groupmiddle -3.61 4.10 -0.88 0.38 -10.38 3.15
age_groupolder -0.52 4.13 -0.13 0.90 -7.33 6.29
tidy(m1_imp, conf.int = TRUE, conf.level = 0.90) |> 
  kbl(digits = 2) |> kable_styling(font_size = 28)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 34.85 2.91 11.99 0.00 30.06 39.65
age_groupmiddle -4.18 4.11 -1.02 0.31 -10.97 2.60
age_groupolder -0.46 4.13 -0.11 0.91 -7.28 6.36

Compare Summaries with glance

glance(m1) |>
  select(r.squared, AIC, sigma, nobs, df, df.residual) |>
  kbl(digits = c(4, 1, 1, 0, 0, 0)) |> kable_styling(font_size = 28)
r.squared AIC sigma nobs df df.residual
0.0032 2762 28.5 289 2 286
glance(m1_imp) |>
  select(r.squared, AIC, sigma, nobs, df, df.residual) |>
  kbl(digits = c(4, 1, 1, 0, 0, 0)) |> kable_styling(font_size = 28)
r.squared AIC sigma nobs df df.residual
0.0043 2795.7 28.8 292 2 289

But these two models do not have the same outcomes. Why not? What does this mean about comparing the glance results?

Three Types of Missingness

  1. MCAR = Missingness completely at random.

A variable is missing completely at random if the probability of missingness is the same for all units, for example, if for each subject, we decide whether to collect data on a measure by rolling a die and refusing to answer if a “6” shows up. If data are missing completely at random, then throwing out cases with missing data (i.e. doing a complete case analysis) does not bias your inferences.

Three Types of Missingness

  1. MAR = Missingness at random.

Missingness that depends only on observed predictors. A more general assumption, called missing at random or MAR, is that the probability a variable is missing depends only on available information. Here, we would have to be willing to assume that the probability of nonresponse to depends only on the other, fully recorded variables in the data.

  • Here is the situation that most obviously cries out for imputation.

Three Types of Missingness

  1. Missing not at random

This is a bigger problem, and includes both:

  • Missingness that depends on unobserved predictors. Missingness is no longer “at random” if it depends on information that has not been recorded and this information also predicts the missing values.
  • Missingness that depends on the missing value itself. For example, suppose that people with higher earnings are less likely to reveal them.

OK, back to working with complete cases

Back to our Comparison Boxplot

  • Does comparing means make sense here?
  • Are these sample distributions “Normal-ish”?

Would a Transformation Help Us?

favstats(~ recur_time, data = bs_cc)
  min  Q1 median    Q3   max     mean       sd   n missing
 0.27 7.6   25.3 52.07 103.6 32.89225 28.47644 289       0

Since all recur_time values are positive, we might look at:

\(log(time)\), or \(1/time\), or \(\sqrt{time}\), or \(time^2\), for example…

What are we hoping these transformations will do?

Boxplot 0: recur_time by age_group

Boxplot 1: log(recur_time) by age_group

Boxplot 2: 1/(recur_time) by age_group

Boxplot 3: \(\sqrt{time}\) by age_group

Code for Boxplot 3

ggplot(data = bs_cc, aes(x = age_group, y = sqrt(recur_time))) +
  geom_violin() +
  geom_boxplot(aes(fill = age_group), width = 0.3, 
               notch = TRUE, outlier.size = 2) +
  stat_summary(fun = "mean", geom = "point", 
               shape = 23, size = 3, fill = "white") +
  guides(fill = "none") +
  coord_flip() +
  scale_fill_viridis_d(alpha = 0.5) +
  labs(y = "Square Root of Recurrence Time",
       x = "Red Blood Cell age group",
       title = "Square Root of Recurrence Time by RBC Age Group",
       caption = "Diamonds indicate sample means")

Ridgeline Plot for \(\sqrt{time}\)?

ggplot(data = bs_cc, aes(x = sqrt(recur_time), y = age_group, 
                       fill = age_group)) +
  geom_density_ridges(alpha = 0.5) +
  guides(fill = "none") +
  labs(title = "Square Root of Time to Recurrence, by RBC Age Group")

Fit a Model to predict \(\sqrt{time}\)?

m2 <- lm(sqrt(recur_time) ~ age_group, data = bs_cc)

m2

Call:
lm(formula = sqrt(recur_time) ~ age_group, data = bs_cc)

Coefficients:
    (Intercept)  age_groupmiddle   age_groupolder  
        5.17036         -0.29923          0.01355  

Predicted Values using m2

sqrt(recur_time) = 5.17 - 0.299 (age_group = Middle) + 0.014 (age_group = Older)

age_group Est. \(\sqrt{time}\) Est. recur_time
Younger 5.17 ?
Middle 5.17 - 0.299 = 4.871 ?
Older ? ?

Predicted recur_time using m2

sqrt(recur_time) = 5.17 - 0.299 (age_group = Middle) + 0.014 (age_group = Older)

age_group Est. \(\sqrt{time}\) Est. recur_time
Younger 5.17 26.73
Middle 5.17 - 0.299 = 4.871 23.73
Older 5.17 + 0.014 = 5.184 26.87

Tidy model m2

tidy(m2, conf.int = TRUE, conf.level = 0.90) |> 
  kbl(digits = 2) |> kable_styling(font_size = 28)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 5.17 0.27 18.87 0.00 4.72 5.62
age_groupmiddle -0.30 0.39 -0.78 0.44 -0.94 0.34
age_groupolder 0.01 0.39 0.03 0.97 -0.63 0.65

glance to summarize m2’s fit

glance(m2) |>
  select(r.squared, AIC, sigma, nobs, df, df.residual) |>
  kbl(digits = c(4, 1, 1, 0, 0, 0)) |> kable_styling(font_size = 28)
r.squared AIC sigma nobs df df.residual
0.0029 1395.9 2.7 289 2 286

m2: first two residual plots

par(mfrow = c(1,2)); plot(m2, which = 1:2); par(mfrow = c(1,1))

An Example to Work through on your own

Predict time with units

Some data prep required:

  • units is actually a count.
  • Use all 291 observations with recur_time and units.
bs_dat2 <- bs_dat |>
  filter(complete.cases(recur_time, units))

bs_dat2 |> tabyl(units)
 units   n   percent
     1  67 0.2302405
     2 174 0.5979381
     3  19 0.0652921
     4  31 0.1065292

Scatterplot of recur_time vs. age_group

ggplot(bs_dat2, aes(x = age_group, y = recur_time)) +
  geom_point() + geom_smooth(method = "lm", se = FALSE)

Comparison Boxplot

ggplot(data = bs_dat2, aes(x = factor(units), y = recur_time)) +
  geom_violin() +
  geom_boxplot(aes(fill = factor(units)), width = 0.3, 
               outlier.size = 2) +
  stat_summary(fun = "mean", geom = "point", 
               shape = 23, size = 3, fill = "white") +
  guides(fill = "none") +
  coord_flip() +
  scale_fill_viridis_d(alpha = 0.5) +
  labs(y = "Recurrence Time (in months)",
       x = "Number of Units Received",
       title = "Recurrence Time by Units",
       caption = "Diamonds indicate sample means")

Comparison Boxplot

Model Time using Units

m3 <- lm(recur_time ~ units, data = bs_dat2)

tidy(m3, conf.int = TRUE, conf.level = 0.90)
# A tibble: 2 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)    37.5       4.41      8.50 1.06e-15    30.2      44.8 
2 units          -2.04      1.99     -1.03 3.06e- 1    -5.32      1.24

Model Square Root of Time using Units

m4 <- lm(sqrt(recur_time) ~ units, data = bs_dat2)

tidy(m4, conf.int = TRUE, conf.level = 0.90)
# A tibble: 2 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)    5.54      0.413     13.4  3.33e-32    4.85     6.22  
2 units         -0.211     0.186     -1.13 2.59e- 1   -0.518    0.0967

Normal Q-Q plots of Residuals

m3_aug <- augment(m3, data = bs_dat2)
m4_aug <- augment(m4, data = bs_dat2)

p1 <- ggplot(m3_aug, aes(sample = .resid)) +
  geom_qq() + geom_qq_line(col = "red") +
  theme(aspect.ratio = 1) +
  labs(title = "Model m3 Residuals", x = "", y = "")

p2 <- ggplot(m4_aug, aes(sample = .resid)) +
  geom_qq() + geom_qq_line(col = "red") +
  theme(aspect.ratio = 1) +
  labs(title = "Model m4 Residuals", x = "", y = "")

p1 + p2

Normal Q-Q plots of Residuals

m3 residual plots

par(mfrow = c(1,2)); plot(m3, which = 1:2); par(mfrow = c(1,1))

m4 residual plots

par(mfrow = c(1,2)); plot(m4, which = 1:2); par(mfrow = c(1,1))

Compare fits of m1 and m3?

glance(m1) |> select(r.squared, AIC, sigma, df, df.residual, nobs)
# A tibble: 1 × 6
  r.squared   AIC sigma    df df.residual  nobs
      <dbl> <dbl> <dbl> <dbl>       <int> <int>
1   0.00318 2762.  28.5     2         286   289
glance(m3) |> select(r.squared, AIC, sigma, df, df.residual, nobs)
# A tibble: 1 × 6
  r.squared   AIC sigma    df df.residual  nobs
      <dbl> <dbl> <dbl> <dbl>       <int> <int>
1   0.00362 2785.  28.8     1         289   291

Are these two models actually predicting the same outcome?

  • for the same subjects?

Session Information

sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.2   forcats_1.0.0     stringr_1.5.0     purrr_1.0.2      
 [5] readr_2.1.4       tidyr_1.3.0       tibble_3.2.1      tidyverse_2.0.0  
 [9] patchwork_1.1.3   simputation_0.2.8 naniar_1.0.0      mosaic_1.8.4.2   
[13] mosaicData_0.20.3 ggformula_0.10.4  dplyr_1.1.2       Matrix_1.6-1     
[17] ggplot2_3.4.3     lattice_0.21-8    janitor_2.2.0     kableExtra_1.3.4 
[21] ggridges_0.5.4    broom_1.0.5      

loaded via a namespace (and not attached):
 [1] gtable_0.3.4       xfun_0.40          visdat_0.6.0       tzdb_0.4.0        
 [5] vctrs_0.6.3        tools_4.3.1        generics_0.1.3     fansi_1.0.4       
 [9] highr_0.10         pkgconfig_2.0.3    webshot_0.5.5      lifecycle_1.0.3   
[13] compiler_4.3.1     farver_2.1.1       munsell_0.5.0      ggforce_0.4.1     
[17] ggstance_0.3.6     snakecase_0.11.1   htmltools_0.5.6    yaml_2.3.7        
[21] pillar_1.9.0       ellipsis_0.3.2     MASS_7.3-60        gower_1.0.1       
[25] rpart_4.1.19       nlme_3.1-162       mosaicCore_0.9.2.1 tidyselect_1.2.0  
[29] rvest_1.0.3        digest_0.6.33      stringi_1.7.12     splines_4.3.1     
[33] labeling_0.4.3     polyclip_1.10-6    labelled_2.12.0    fastmap_1.1.1     
[37] grid_4.3.1         colorspace_2.1-0   cli_3.6.1          magrittr_2.0.3    
[41] utf8_1.2.3         withr_2.5.1        scales_1.2.1       backports_1.4.1   
[45] timechange_0.2.0   rmarkdown_2.25     httr_1.4.7         hms_1.1.3         
[49] evaluate_0.21      knitr_1.44         haven_2.5.3        viridisLite_0.4.2 
[53] mgcv_1.8-42        rlang_1.1.1        Rcpp_1.0.11        glue_1.6.2        
[57] tweenr_2.0.2       xml2_1.3.5         svglite_2.1.1      rstudioapi_0.15.0 
[61] jsonlite_1.8.7     R6_2.5.1           systemfonts_1.0.4